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Abstract. Text mining has been used for various purposes, such as document classification and 
extraction of domain-specific information from text. In this paper we present a study in which text 
mining methodology and algorithms were properly employed for academic dishonesty (cheating) 
detection and evaluation on open-ended college exams, based on document classification tech¬ 
niques. Firstly, we propose two classification models for cheating detection by using a decision tree 
supervised algorithm. Then, both classifiers are compared against the result produced by a domain 
expert. The results point out that one of the classifiers achieved an excellent quality in detecting 
and evaluating cheating in exams, making possible its use in real school and college environments. 

Keywords: architectures for educational technology system, evaluation methodologies, improving 
classroom teaching, pedagogical issues 


1. Introduction 

In a world where most of the corporate data is available in textual format, text mining has 
emerged as a powerful tool to support knowledge management. Considered as a branch 
of data mining, the purpose of text mining is to find patterns, tendencies and regularities 
in documents written in natural language (Feldman and Sanger, 2007). Examples of text 
mining applications include: extraction of domain-specific information from text, email 
filtering, search engines, and document categorization (Berry, 2004). 

Although data and text mining applications are commonly employed for industrial and 
commercial purposes, they can also be used for educational aims. Most related work are 
focused on e-learning environments (Romero et al., 2008; Delavari et al., 2008; Lin etal., 
2009) and/or plagiarism detection (Adeva et al, 2006; Sorokina et al., 2006; Butakov 
and Scherbinin, 2008). However, this work addresses another practical application of 
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text mining in the education domain: detection and evaluation of academic dishonesty 
(cheating) on written scholar exams. 

According to researches from Brazilian’s public universities and schools, in an aca¬ 
demic environment the occurrence of cheating is extremely common (da Silva et al., 
2006; Silva et al., 2009). This practice represents an old problem without a concrete so¬ 
lution (Rangel, 2001). There is not a precise definition of cheating, but it is supposed that 
the practice occurs every time two or more exams have a certain degree of similarity with 
respect to their answers. The cheating dimension is variable. It can be part of a question, 
the whole question, some questions, or the whole exam. In addition, cheating can be harsh 
(i.e., copy-paste), or subtle (i.e., a partial copy-paste). 

The practice of cheating is present all over the world, in all segments of education, 
from grade school to graduate school (Davis et al., 2009; Guthrie, 2009). Efforts have 
been done to find ways to prevent students from cheating (Guthrie, 2009; Broeckelman- 
Post, 2008) or even to predict when a student will probably cheat (Passow et al., 
2006; Kremmer et al., 2007). 

Besides prevention and prediction techniques, it is also possible to use computer pro¬ 
grams to detect cheating on exams. In this sense, most of the papers propose statisti¬ 
cal techniques to detect cheating on multiple choice tests or exams (McManus et al., 
2005; Sotaridona et al., 2006; van der Ark et al., 2008; DiSario et al., 2009). On the other 
side, in this paper we show how text mining algorithms can be used together as a promis¬ 
ing technique not only to detect but also to evaluate cheating on open-ended exams. To the 
best of our knowledge this is the first work that shows how to use the text mining technol¬ 
ogy in order to develop a solution that detects and evaluates cheating on scholar exams. 

The rest of this paper is organized as follows. The related work concerning plagiarism 
and cheating detection is discussed in Section 2. A background of text mining concepts 
is presented in Section 3. In Section 4, we describe a case study performed at a Federal 
University in Brazil, where a supervised classification algorithm was employed to create 
inference models capable to detect the presence and level of cheating in a real set of 
scholar exams. Section 5 presents the evaluation of the models, comparing them against a 
model produced by a human specialist. Section 6 offers an analysis of the results. Finally, 
our conclusions and suggestions for further work are presented in Section 7. 


2. Related Work 

A problem that is pedagogically similar to cheating on scholar exams is plagiarizing aca¬ 
demic work. Plagiarism is an act of fraud that involves both stealing someone else’s work 
and lying about it afterward. Plagiarism usually occurs in academia where documents 
are typically essays or reports. However, plagiarism is also widely present in scientific 
papers, art designs, and program source code. 

The widespread use of computers and the advent of the Internet have made it easier to 
plagiarize others’ work. Students are less likely to commit plagiarism if they know that 
their work will be checked by a plagiarism detection application. Plagiarism detection is 
the process of locating instances of plagiarism within a work or document. Our related 
work emphasizes industry and academic solutions for plagiarism detection. 
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Table 1 

Some commercial and free plagiarism tools 


Name 

License 

Description 

Ephorus 1 

Proprietary 

Web-based application used to prevent and detect plagiarism in scholar work. 
The user can upload documents to be checked for similarities against Internet 
sources and other student papers uploaded by instructors. As a result, the applica¬ 
tion returns a report containing the similarities between the submitted document 
and the sources found. 

Plagium 2 

Proprietary 

Web-based application that checks whether the content of a website or research 
paper has been copied and used elsewhere. It works similar to a search engine. 
However, differently from Google or Yahoo that often imposes a limit of 10- 
12 keywords per search, the application accepts much larger blocks of text for 
searching online. Plagium breaks up the input text into smaller “snippets”. These 
snippets are matched against Web content, with the matches scored to determine 
what documents match the input text. 

Sherlock 3 

Free 

It uses digital signatures to find similar pieces of text. Sherlock works on text 
files such as essays, computer source code files, and other assignments in digital 
form. The program output offers the percentage of similarity between each pair 
of documents in the set of documents provided as input. 

Urkund 4 

Free 

It checks a document against three central sources: the Internet, published mate¬ 
rials and materials previously submitted by students, e.g., memos, case studies, 
and degree work (theses/dissertations). The system highlights the parts of a doc¬ 
ument that disclose similarities with the three sources. A percentage indication 
for each hit in the document is offered as output. It is then up to a tutor to decide 
whether this should be regarded as a piece of plagiarism. 


1 http://www.ephorus.pt, 2 http://www.plagium.com, 
3 White and Joy (2004), 4 http: / /www. urkund. com. 


2.1. Tools 

There are several commercial and free online applications for text-plagiarism detection. 
A short description of some of them is presented in Table 1. Most of them use a web- 
based architecture, checking if a certain document is similar to others available online. 
However, this reality diverges from the task of detecting cheat on scholar exams since 
that plagiarism in this case occurs locally, i.e., at a physical location. 

2.2. Academic Papers 

In the literature, many researches deal with the plagiarism problem. (Lukashenko et al., 
2007) present a survey of methods and applications to detect plagiarism. More recent 
articles (Barron-Cedeno and Rosso, 2009; Butakov and Scherbinin, 2008) propose new 
techniques to deal with the plagiarism problem. 

Concerning the practice of cheating, studies point out that this is a habit present all 
over the world, in all segments of education, from elementary school to graduation (Davis 
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et al., 2009; Guthrie, 2009). Many efforts have been done to find ways of avoiding stu¬ 
dents from cheating (Guthrie, 2009; Broeckelman-Post, 2008) or even of preventing a 
student from cheating (Passow et al., 2006; Kremmer et al., 2007). 

Besides the techniques applied to prevent cheating, it is also possible to use computer 
programs to detect cheating on scholar works and exams. In this sense, most of the ar¬ 
ticles propose statistical techniques to detect cheating on multiple choice scholar exams 
(McManus et al., 2005; Sotaridona et al., 2006; van der Ark et al., 2008; DiSario et al., 
2009). Instead, in this paper we show how text mining algorithms can be employed to 
detect and evaluate cheating in exams based on open-ended questions. 


3. Background 

Data mining usually deals with structured data, i.e., data stored in a well-defined format 
such as worksheets and databases (Tan et al., 2005). Text mining is considered a type of 
data mining that deals with non-structured data (Feldman and Sanger, 2007). Information 
Retrieval as well as supervised and non-supervised classification of documents are some 
of the research areas in which text mining is applied. 

Classification techniques can be defined as the task of assigning objects to one of sev¬ 
eral predefined categories (also known as class labels; Tan et al., 2005). The classification 
is said to be supervised when we already have the information of the classes. On the other 
hand, the non-supervised classification is used when this information is missing. 

A representation of the classification task is shown in Fig. 1, where the input x is the 
set of attributes of an object and the output y is the class label that informs the class of 
that object. A classification model has an hybrid usage, either as a descriptive model or 
a predictive model. The former can serve as an explanatory tool to distinguish between 
objects of different classes. The latter can be used to predict the class label of unknown 
data. Examples of classification models include: decision tree classifiers, fc-nearest neigh¬ 
bors, neural networks, support vector machines, rule-based classifiers, and naive Bayes 
classifiers (Witten and Frank, 2005). 

3.1. Document Representation 

Due to the non-structured aspect of text documents, an essential task executed at the pre¬ 
processing step of the text mining process is to assign some structure to the content stored 
in the documents (Feldman and Sanger, 2007). This task ensures that documents can be 


Input 

Attribute set 

(x) 



Classification 

model 



Output 
Class label 

(y) 


Fig. 1. Classification as the task of mapping an input x into its class label y (Tan et al., 2005). 
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better handled by knowledge extraction algorithms. Although some of these algorithms 
require sophisticated information, such as the ones based on linguistic knowledge, most 
of the pattern extraction algorithms only require the documents to be represented in a 
spreadsheet format. In such format, denoted as bag of words, lines correspond to docu¬ 
ments and columns represent the terms contained in the document collection. Terms are 
independent and form an unordered set in which the order of occurrence is not taken into 
consideration. One possibility to represent a bag of words is using attribute-value tables 
(Berry, 2004). 

An example of such representation is illustrated in Table 2, where di corresponds to 
the ith document, tj represents the jth attribute (term), aij is the measure that relates di 
and tj. y.j represents the class (or label) in which the document is classified. 

According to Table 2, each document can be represented as a vector d,; = ( ai,yi ), 
where a* = (an, 0 , 2 , • • •, cum) and yi represents the class of the document. Several 
measures have been proposed to compute the values of a l:i . These measures are classified 
into two types: binary and frequency-based. Binary measures indicate the occurrence (or 
not) of a term in a certain document. They can be used to extract information about the 
similarity of documents considering the number of terms in common. 

Frequency-based measures aim at counting the occurrences of a certain term in a given 
document. They can be used for instance to extract statistical measures in the extraction 
of patterns. Among the frequency-based measures, it is possible to distinguish two other 
groups: supervised measures, which depend on the availability of data with a well-known 
class value (last column of Table 2), measuring the importance of a certain attribute to 
determine the class value; and non-supervised measures which are applicable to non- 
labelled data. 

ConfWeight (Soucy and Mineau, 2005) and Mutual Information (Berry, 2004) are ex¬ 
amples of supervised measures. As examples of non-supervised measures we have TF 
(term frequency), which considers the absolute frequency of terms in documents (Rijs- 
bergen, 1979), IDF (inverse document frequency) (Salton et al ., 1975), which computes 
the inverse frequency of a term, favoring those terms that appear in few documents of the 
collection; and TF-IDF (Salton and Buckley, 1988), consisting in a combination of the 
two previous measures (TF and IDF). 


Table 2 

Documents as a vector representation 
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3.2. Similarity Between Documents 


A common way to check whether two documents are similar is to verify the terms (words) 
contained in both documents. Additionally, it is necessary to verify the frequency of each 
term in the document. Such method is called term frequency (TF). Flowever, due to the 
high occurrence of some kinds of terms (e.g., articles or prepositions), the inverse fre¬ 
quency factor of a document (IDF) is used to ponder the frequency of terms. As a result, 
frequent terms have a lower weight than unusual terms. This method is denoted as TF-IDF 
(term frequency - inverse document frequency). It was proposed by (Salton and Buckley, 
1988) and is commonly used in Information Retrieval (Soucy and Mineau, 2005). 

Formally, the frequency of a term i that appear in a document d :l is: 


TF, 


n i,j 

Sfc n k,j 


( 1 ) 


where riij is the occurrence of the term i in document dj and the denominator is the sum 
of occurrences of all terms in dj. Given that N is the total of documents, the formula that 
computes the inverse frequency of a document (IDF) is: 


IDF, 


log 


N 

| d: ti <E d\ ’ 


( 2 ) 


where |d: t, C <7 represents the number of documents in which the term t, appears. In 
this sense, the value of TF-IDF for a term i in a document j is: 


TFIDF,, = TF tiJ x IDF,. (3) 

The computational cost of the method TF-IDF is O(NM), where N is the number of 
documents and M is the number of terms (see Table 2). 

As discussed in Section 3.1, a document is represented as a vector dj = (a ?1 , a i2 , ■ • ■, 
cLim), where each term a l j is calculated according to the TF-IDF method. The similar¬ 
ity between two documents l)\ and D 2 is determined by the cosine between the two 
vectors (4). 


Cosine(Di,D 2 ) = D J. * ? 2 , (4) 

|Di||D 2 | 

where Di • l) 2 represents the scalar product of the vectors whilst \Di | and D > | represent 
the module of the vectors. 

The cosine similarity value is a positive number which varies between 0 (minimum) 
and 1 (maximum). The first value implies that the two documents are totally different, and 
the second that they are completely similar. The cosine similarity method is considered a 
standard measure in text mining researches (Berry, 2004; Weiss et al., 2005). 

Another text similarity metric is the overlap coefficient, derived from the Jaccard co¬ 
efficient (Berry, 2004). To compute this metric, instead of using the document as a vector 
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we make use of the document itself, which can be viewed as a set of words. The over¬ 
lap between two documents/sets D\ and D 2 is equal to the intersection between the two 
sets divided by the size of the smaller one (5). As the cosine similarity, the value ranges 
from 0 (minimum) to 1 (maximum). Similarly, 0 indicates no document similarity, and 1 
maximum similarity. Examples of open source libraries containing text similarity metrics 
include SimMetrics (Chapman, 2004) and SecondString (Cohen et al., 2003). 


Overlap(Di, Df) 


\Di fl D 2 1 

min(|£>i|, |D 2 |)‘ 


(5) 


3.3. Quality Metrics 

In the text mining literature, there are several metrics that quantify and qualify the predic¬ 
tive models (e.g., supervised classification and regression). Table 3 presents the number 
of correct classifications in contrast with predicted classifications for the classes ‘+’ e 
’ of a binary model. This table, denoted confusion matrix, enables the computation of 
the following metrics: accuracy, precision, and recall. 

The recall of a class is defined as the ratio between the number of correctly classified 
documents and all documents belonging to the class. Precision is the ratio between the 
number of correctly classified documents and all documents considered by the model as 
belonging to the class (Feldman and Sanger, 2007). 

While the previous metrics are calculated for each class of the model, Accuracy is a 
global metric. It reflects the hit ratio, i.e., the proportion between the correctly inferred 
classifications and the total of inferred classifications. Considering the example of Ta¬ 
ble 3, we have that: 

TP + TN 

ccuracy - xp + FN + F p + TN ' (6) 

Besides the previous metrics, there is a statistical coefficient denoted Kappa index or 
K Statistic (Cohen, 1960), which is a measure of agreement in nominal scales, largely 
used in Medicine, although there is also an occurrence of using this metric on detecting 
answer copying in exams (Sotaridona et al., 2006). When applied to the context of the 


Table 3 

An example of a confusion matrix for a problem involving two classes 


Prediction 

True 


Precision 


+ 

- 


+ 

TP“ 

FP C 

TP/(TP + FP) 

- 

FN 6 

TN d 

TN/(TN + FN) 

Recall 

TP/(TP + FN) 

FP/(FP + TN) 



a True positive. b False negative. c False positive. d True negative. 
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text mining task classification, the Kappa index indicates the level of agreement between 
the model classification and a reference classification. In other words, it determines how 
much the two models agree with respect to the classification. 

Considering the confusion matrix described in Table 3, the Kappa index is calculated 
according to (refeq:k). As the cosine similarity metric, k also varies between 0 (minimum 
agreement) and 1 (maximum agreement). 


n 2 ■ accuracy — [X + Y] 
n 2 - [X + Y] 


(7) 


where X = (TP + FN) • (TP + FP), Y = (FN + TN) • (FN + FP) and accuracy is 
given by (6). The use of the previous quality metrics enables the adequate evaluation of 
the cheating classification models presented in the following section. 


4. Methodology 

To check how text mining and supervised classification techniques can be applied to¬ 
gether in the detection of cheating on scholar exams, we developed a case study. It was 
performed at the Federal University of Campina Grande - Brazil, in a project involv¬ 
ing the Business Management and Computer Science departments. Considering that text 
mining is a sub-area of data mining, the steps followed in the case study are based on the 
data mining methodology proposed by Tan et al. (2005). The steps include: data selection, 
preprocessing, data transformation, data mining, and analysis. 

4.1. Data Selection and Preprocessing 

A set of thirty scholar exams written in the Brazilian Portuguese language were selected 
to compose the case study. Each exam contained four open-ended questions in the area of 
administration and sub-area of marketing. The exams were answered by the students and 
stored in electronic format as plain text (e.g., text hies). There was no need for sampling 
operation, since all the exams were used in the data mining process. 

In a real life situation, a teacher detects cheating when comparing the answer of some 
question answered by a student A against the answer of the same question provided by 
a student B. In this sense, we divided each exam into four distinct parts. Each part cor¬ 
responds to a different question. The answer (text) of each question was considered as 
the target for the text mining process. For each question, we defined a controlled dictio¬ 
nary containing a set of words that could be used by students to answer the question. In 
this light, when two answers of the same question contained a high number of identical 
words, it was considered a strong evidence of cheating. 

To enable the correct application of the data mining algorithms, punctuation and ac¬ 
centuation were removed from each document. In general, this task is needed to minimize 
the size of document vectors (Table 2) as well as to avoid the need to distinguish words 
that in fact are lexically the same (e.g., ‘eletrico’ vs. ‘eletrico’ 1 ). Although such opera- 

1 In English, electric. 
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Fig. 2. Portuguese stemming. Source: adapted from Morais (2007). 


tion can bring benefits in most of the cases, we are aware that it can treat two lexically 
different words as the same. For instance, words that are different due to the use of accen¬ 
tuation (e.g., pelo/pelo/pelo 2 ) 3 . Flowever, since these cases are unusual, we believe that 
the operation can bring more gains than losses. This task was implemented using the Java 
API 1.5 and the Eclipse IDE tool (Eclipse, 2010). 

4.2. Data Transformation 

After removing punctuation and accentuation, we started a tokenization process to trans¬ 
form each document into a set of words or tokens. Tokens with less than three characters 
were not considered. This enabled the removal of common grammatical elements, e.g., 
prepositions, articles, and conjunctions. It also helped to minimize the size of document 
vectors and optimize the data mining algorithm. 

The next step consisted in removing irrelevant words (denoted as stopwords). To this 
end, we used an adaptation of the stopword dictionary from the Snowball project (Porter 
and Boulton, 2002), which is written in the Brazilian Portuguese language. 

Afterwards, a morphological normalization process (denoted as stemming) was per¬ 
formed. Such process consists in transforming words into primitive terms (see Fig. 2). For 
instance, consider that in the answer of the question X a student A uses the phrase ‘... 
processes the product to ...’, whilst student B takes a look at the exam of student A and 
writes ‘... the product is processed to ...’. We can notice that the words 'processes’ and 
‘processed’ have the same radical ‘process’. The morphological normalization enables 
the removal of characters referring to plural, feminine gender, augmentative, diminutive, 
etc., keeping only the radical of the words. This step was implemented using the stem¬ 
ming algorithm of the Snowball project (Porter and Boulton, 2002). 

Documents also need to be semantically normalized. This task consists in mapping all 
the synonyms of a word into a single base term. To this end, a lexical base written in the 
same language of the documents can be used. Examples of lexical bases for the English 

2 In English, for/to strip/pelage. 

3 With the new Portuguese language spelling international agreement, some words that were differentiated 
by accentuation are now written in the same way. As an example, the words pelo/pelo/pelo will be written 
without accentuation (Cunha, 2009). 
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Fig. 3. Text mining process flow in the RapidMiner tool. 


and Portuguese languages are WordNet (Miller, 1995) and WordNet.PT (Palmira et al., 
2010), respectively. After all the previous tasks, each question of each student (document) 
was transformed into a vector of words (as detailed in Section 3.1), according to the TF- 
IDF method (equation (3) in Section 3.2). 

Since the size of the answers was small (i.e., one or two paragraphs), the use of vector 
compressing techniques was not required. The average size of the vectors was nearly 450 
columns. In addition, pruning techniques were not considered, since its use led to worst 
results during the similarity computation between documents. 

All tasks related to data transformation were performed using the RapidMiner tool 
(Mierswa et al., 2006; Rapid-I, 2010), an open source software for knowledge discovery, 
machine learning, and data mining. Fig. 3 illustrates the tasks of the data transformation 
step, detached as a rectangle. 

4.3. Text Mining 

The data mining process involved two tasks. First, we computed the cosine similarity 
and the overlap coefficient for each pair of documents (operator ExampleSet2Similarity 
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Table 4 

Portion of the data obtained after text mining and used for the supervised classification models 


ID 

Cosine Similarity 

Overlap Coefficient 

Cheating Level a 

(1,1,25) 

0.817540505 

0.803810564 

High 

(1,1,30) 

0.771951927 

0.780808721 

High 

(1,2,29) 

0.71391944 

0.68778898 

High 

(1,25,30) 

0.69305863 

0.690211565 

High 

(1,13,24) 

0.501569582 

0.457819896 

High 

(1,13,23) 

0.475967316 

0.395380058 

High 

(1,23,24) 

0.384379516 

0.384832118 

High 

(1,12,22) 

0.348145528 

0.358832118 

Intermediary 

(1,1,27) 

0.262484497 

0.244609219 

Low 

(1,25,27) 

0.25834858 

0.244129913 

Low 

(1,13,8) 

0.240827327 

0.151298724 

None 

(1,27,30) 

0.239963138 

0.245380561 

Low 

(1,1,26) 

0.21861438 

0.301880606 

Intermediary 

(1,26,30) 

0.19805587 

0.29802218 

Low 

(1,11,17) 

0.187388827 

0.198169741 

None 

(1,11,21) 

0.181043601 

0.231319903 

Low 

(1,24,8) 

0.166791662 

0.118873948 

None 

(1,20,24) 

0.160367724 

0.182427147 

None 


“ According to the human specialist. 


of Fig. 3). Then, we created a new data sheet containing the values of these two met¬ 
rics. Table 4 shows an excerpt from this data sheet. The first column, ID, specifies the 
question identifier Q and the students code, X and Y. The second and third columns 
contain respectively the values of the cosine similarity (4) and the overlap coefficient (5) 
between the answers provided by students A' and Y for question Q. The last column 
was filled with the cheating level identified after a traditional exam evaluation done by 
the course lecturer, denoted here as the specialist. Each pair of exams contains different 
levels of cheating: none, low, intermediary, and high. The cheating mapping between all 
students is detailed in Table 5. Since our sample has thirty exams, and each one con¬ 
tains four questions, both the cosine similarity and the overlap coefficient were executed 
4 • ( 3 2 °) = 4 • 435 = 1740 times. 


5. Results 

To simplify the visualization of cheating on exams, consider a graph G = (V, A), where 
V is the set of exams (identified by the student code) and A is the set of edges that 
link questions whose similarity is higher than a threshold 7 . For each value of 7 there 
is a unique similarity graph. With a correct adjust of 7 , it is possible to obtain similarity 
graphs for each level of cheating. 
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Table 5 

Cheating on exams done by students according to the human specialist 



Question 1 



Question 2 


Question 3 


Question 4 


ID 

L“ 
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H c 

L“ I 6 
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H c 

L“ 

I» 

H c 
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27 

26 
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30 
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27 
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1, 25 
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° Low, b Intermediary, and c High cheating. 


Figure 4(a) is a circular graph illustrating the pairs for the first question that had higher 
values regarding the overlap coefficient, meaning the students that probably cheat on this 
question. The circular graphs for the remaining questions are presented in Figs. 4(b), 
4(c) and 4(d). Another form of visualization is shown in Fig. 5 where the most similar 
questions are placed near each other 4 . This form helps a teacher to quickly discover the 
students that answered the question in a similar manner. 

5.1. Supervised Classification Models 

A Decision Tree (DT) algorithm was employed in order to build models able to detect 
and evaluate cheating on scholar exams. DT is considered one of the most widespread 
and consolidated supervised classification algorithms (Larose, 2004). 

We use a DT algorithm similar to the C4.5 algorithm (Quinlan, 1993). The maximum 
tree depth was set to 4, which corresponds to the number of classes (high, intermediary, 
low and none), and the confidence level for pessimistic pruning was set to 0.25. Both the 
cosine similarity and the overlap coefficient between all pairs of questions were used as 
input data (i.e., attribute), resulting in two classification models of cheating. 

4 The graph was drawn according to Peter Eades’ method for drawing undirected graph (Eades et ah, 2010). 
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Fig. 4. Similarity graphs for all exam questions. 


The validation of the DT models was done through the stratified ten-fold cross- 
validation approach, which is the standard statistical technique for validating a learning 
algorithm (Larose, 2004). In this technique, the data is divided randomly and uniformly 
into 10 parts (stratified sampling). Each part is used as a holdout set and the other nine 
parts are used to train the model, totalizing ten combinations for testing. For each one, 
the error rate is calculated on the holdout set, and thus the learning procedure is executed 
10 times using different training sets. Finally, the 10 error estimations are averaged to 
yield an overall error estimate. 

The cheating percentages defined by the specialist and the two DT models are pre¬ 
sented in Fig. 6. The DT cosine based-similarity hit the real intermediary cheating per¬ 
centage (i.e., 19%), but the low and high cheating percentages were far from the special¬ 
ist’s model (17%/64% and 26%/55%). On the other hand, the DT overlap-based cheating 
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Fig. 5. Groups of similar answers for the first question of the exam. 


percentages were closer to the specialist’s model. Thus, for this first comparison, the DT 
overlap-based showed results closer to the specialist’s conclusion. 

The decision tree model based on the cosine similarity is shown in Fig. 7. Let 
Cosine(<2, X, Y) mean the cosine similarity between the answers to the question Q pro¬ 
vided by students X and Y. According to this decision tree, if Cosine(<5, X, Y) > 0.358, 
then the model classifies the cheating as high. If 0.288 < Cosine(<5, X, Y) ^ 0.358 than 
it is an intermediary cheating. Obviously, this model can produce wrong levels of cheat¬ 
ing. These errors are reported in the confusion matrix (Table 6). The precision for de¬ 
tecting high cheating was 92.59% but only 37.50% for detecting intermediary cheating. 
In addition, the model presented low recall values for low and intermediary cheating. In 
short, this model was good on detecting cheating, but reasonable for evaluating cheating 
dimension. 

The other DT model (Fig. 8), based on the overlap coefficient, had better results for 
all quality metrics (Table 7). There was only one occurrence of false positive, when the 
model detected a false low cheating. The major improvements against the cosine model 
occurred in the prediction of intermediary and low cheating (66.67% and 69.23% versus 
37.50% and 42.86%), and the recall of low cheating (75.00% versus 25.00%). 

A comparison between the two decision tree models is given in Table 8. The DT 
overlap model achieved better results for both Accuracy and Kappa index 5 as well as a 
lower standard deviation for these quality metrics. The table also shows a 99% confidence 
interval for the accuracy and 95% for the Kappa index. 

We defined a hypothesis test in order to check if the classifier models based on the 
overlap metric had a high agreement with the reference model (i.e., specialist). To this 
end, we consider (9) as the null hypothesis, and (9) as our research hypothesis. We con- 

5 The Kappa index was calculated according to Fleiss et al. (1969, 2003). 
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Fig. 6. Percentages of cheating related to the classification models and the specialist. 


Table 6 

Decision tree confusion matrix when using cosine similarity as unique attribute 


Prediction 

True 





High 

Intermediary 

Low 

None 

Precision 

High 

25 

2 

0 

0 

92.59% 

Intermediary 

0 

3 

3 

2 

37.50% 

Low 

1 

2 

3 

1 

42.86% 

None 

0 

2 

6 

1690 

99.53% 

Recall 

96.15% 

33.33% 

25.00% 

99.82% 
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Fig. 7. Decision tree classification model based on the cosine similarity value between two exam answers. 
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Table 7 

Decision tree confusion matrix when using overlap coefficient as unique attribute 


Prediction 

True 





High 

Intermediary 

Low 

None 

Precision 

High 

25 

2 

0 

0 

92.59% 

Intermediary 

1 

4 

1 

0 

66.67% 

Low 

0 

3 

9 

1 

69.23% 

None 

0 

0 

2 

1692 

99.88% 

Recall 

96.15% 

44.44% 

75.00% 

99.94% 
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Table 8 

Comparison between the decision tree classification models 


DT model 

Accuracy 




Kappa 





mean 

std. 

.99 Conf. 

Int. 

mean 

std. error 

.95 Conf. 

Int. 

Cosine 

98.91% 

0.60% 

98.55% 

98.90% 

0.785 

0.0443 

0.6957 

0.8693 

Overlap 

99.43 % 

0.36 % 

99.05 % 

99.43 % 

0.8904 

0.0318 

0.828 

0.9528 


sidered k\ as the Kappa index for the model based on the cosine similarity and k -2 for the 
model using the overlap coefficient. The hypothesis test is stated at the 95% confidence 
level. 


H 0 -. k - k = 0, (8) 

Hi: k\ — %2 < 0. (9) 


Therefore, we solve the (10) to find the p -value associated to the hypothesis tests: 

= k-k = 0.785 — 0-8904 = 

fvvfo) - Var(fe) V0.00196 - 0.00101 

(p = 0.027). (10) 

We rejected the null hypothesis with 5% of significance level, meaning that the agree¬ 
ment level between the DT overlap-based and the reference models is higher than the DT 
cosine-based and the reference models. 


6. Discussion 

Besides the aforementioned points, we proposed and compared two possible classifi¬ 
cation models for cheating detection using the decision tree supervised algorithm: one 
based on the cosine similarity, and the other based on the overlap coefficient. The latter 
presented better results, achieving an accuracy of 99.43% H— 0.36%, and an agreement 
level (Kappa index) of 0.89 H— 0.032 in comparison with the specialist’s result. This sug¬ 
gests an excellent inference quality in the detection and evaluation of cheating (Landis 
and Koch, 1977). 

The decision tree depicted in Fig. 8 can be used as a kind of oracle for cheating 
detection without the need for the teacher to manually detect the cheating. After the pre¬ 
processing and transformations steps (Sections 4.1 and 4.2), the only necessary task is to 
compute the overlap coefficient for all pairs of exams’ answers. All these steps can be 
done automatically using, for example, the RapidMiner tool. After that, one can directly 
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use the decision rules provided by the decision tree (Fig. 8). Considering that A and B are 
the answers for the same question provided by two different students, then we have that: 6 
(i.) overlap(A, B) ^ 0.22: no cheating, 

(ii.) 0.22 < overlap(A, B) ^ 0.30: low cheating, 

(iii.) 0.30 < overlap(A, B) ^ 0.38: intermediary cheating, 

(iv.) overlap(A, B) > 0.38: high cheating. 

However, it also important to mention that we cannot affirm that the cheating detection 
model can be used for any kind of exam (e.g., a mix between close-ended and open-ended 
questions), as well as for any kind of course (e.g., mathematics or physics). The results 
presented in this paper are valid and indicated to be used only in similar exam’s conditions 
(i.e., only open-ended questions). 


7. Conclusions 

The first point to mention is that a successful case study was employed on the utilization 
of data mining’s methodology and algorithms for helping teachers to deal with an old 
educational problem: academic dishonesty (cheating) on exams. Besides that, it is note¬ 
worthy that only open source (i.e., free) programs were used for all data mining tasks. 
Thus, any person can take advantage of this work in order to repeat the methodology for 
his/her own purpose and without any additional financial charge. 

In this paper we have detailed a potential application that employs text mining in edu¬ 
cation domain. Precisely, it is shown that text mining can be satisfactorily used to develop 
a mechanism for detection and evaluation of cheating on exams based on open-ended 
questions. The solution presented in this paper also fits the need for cheating detection on 
other written-based methods for student assessment (e.g., homeworks). 

The solution presented in this paper can assist a teacher in the difficult and labor- 
intensive task of detecting and evaluating cheating on exams. As further work we intend 
to execute more experiments with scholar exams on other research areas. In addition, we 
intend to consider the physical distribution of students in the classroom as an input to the 
cheating classification model. 
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Sukciavimo per egzaminus nustatymas ir ivertinimas naudojant 
priziurejimo klasifikacija 

Elmano Ramalho CAVALCANTI, Carlos Eduardo PIRES, 

Elmano Pontes CAVALCANTI, Vladia Freire PIRES 

Teksto gavyba buvo naudojama jvairiems tikslams, pavyzdziui, dokumentrj klasifikavimui ir 
specifines srities informacijos istraukimui is teksto. Siame straipsnyje autoriai pateikia tyrim^, 
kuriame teksto gavybos metodika ir algoritmai buvo issamiai naudojami nustatyti ir jvertinti 
akademinj nes^ziningum^ (apgaul?) per neterminuotus kolegijos egzaminus, remiantis doku- 
menti) klasifikavimu. Visp pirma, siulomi du klasifikavimo modeliai sukciavimui nustatyti, nau¬ 
dojant sprendimp medzio priziurejimo algoritma. Abiejp klasifikatorip rezultatai palyginti su is- 
vadomis, pateiktomis tos srities eksperto. Pasirode, kad vienu is klasifikatorip puikiai nustatomas ir 
jvertinamas sukciavimas per egzaminus, todel juo galima naudotis realiame mokyklos ir kolegijos 
darbe. 



